An FPGA Drop-In Replacement for Universal Matrix-Vector Multiplication

نویسندگان

Eric S. Chung

John D. Davis

Srinidhi Kestur

چکیده

We present the design and implementation of a universal, single-bitstream library for accelerating matrixvector multiplication using FPGAs. Our library handles multiple matrix encodings ranging from dense to multiple sparse formats. A key novelty in our approach is the introduction of a hardware-optimized sparse matrix representation called Compressed Variable-Length Bit Vector (CVBV), which reduces the storage and bandwidth requirements up to 43% (on average 25%) compared to compressed sparse row (CSR) across all the matrices from the University of Florida Sparse Matrix Collection. Our hardware incorporates a runtimeprogrammable decoder that performs on-the-fly decoding of various formats such as Dense, COO, CSR, DIA, and ELL. The flexibility and scalability of our design is demonstrated across two FPGA platforms: (1) the BEE3 (Virtex-5 LX155T with 16GB of DRAM) and (2) ML605 (Virtex-6 LX240T with 2GB of DRAM). For dense matrices, our approach scales to large data sets with over 1 billion elements, and achieves robust performance independent of the matrix aspect ratio. For sparse matrices, our approach using a compressed representation reduces the overall bandwidth while also achieving comparable efficiency relative to state-of-the-art approaches. Note: this work was published in FCCM’12 [1]. Keywords-FPGA; dense matrix; sparse matrix; spMV; reconfigurable computing

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

On Sparse Matrix-Vector Multiplication with FPGA-Based System

In this paper we report on our experimentation with the use of FPGA-based system to solve the irregular computation problem of evaluating when the matrix A is sparse. The main features of our matrix-vector multiplication algorithm are (i) an organization of the operations to suit the FPGA-based system ability in processing a stream of data, and (ii) the use of distributed arithmetic technique t...

متن کامل

An Efficient LUT Design on FPGA for Memory-Based Multiplication

An efficient Lookup Table (LUT) design for memory-based multiplier is proposed. This multiplier can be preferred in DSP computation where one of the inputs, which is filter coefficient to the multiplier, is fixed. In this design, all possible product terms of input multiplicand with the fixed coefficient are stored directly in memory. In contrast to an earlier proposition Odd Multiple Storage ...

متن کامل

High performance sparse matrix-vector multiplication on FPGA

This paper presents the design and implementation of a high performance sparse matrix-vector multiplication (SpMV) on fieldprogrammable gate array (FPGA). By proposing a new storage format to compress the indexes of non-zero elements by exploiting the substructure of the sparse matrix, our SpMV implementation on a reconfigurable computing platform with a multi-channel memory subsystem is capabl...

متن کامل

Article_A.Cariow_G.Cariowa_Final_proof

Abstract— In this communication we present a hardwareoriented algorithm for constant matrix-vector product calculating, when the all elements of vector and matrix are complex numbers. The main idea behind our algorithm is to combine the advantages of Winograd’s inner product formula with Gauss's trick for complex number multiplication. The proposed algorithm versus the naïve method of analogous...

متن کامل